Diversity and specificity of molecular functions in cyanobacterial symbionts

Cyanobacteria are globally occurring photosynthetic bacteria notable for their contribution to primary production and production of toxins which have detrimental ecosystem impacts. Furthermore, cyanobacteria can form mutualistic symbiotic relationships with a diverse set of eukaryotes, including land plants, aquatic plankton and fungi. Nevertheless, not all cyanobacteria are found in symbiotic associations suggesting symbiotic cyanobacteria have evolved specializations that facilitate host-interactions. Photosynthetic capabilities, nitrogen fixation, and the production of complex biochemicals are key functions provided by host-associated cyanobacterial symbionts. To explore if additional specializations are associated with such lifestyles in cyanobacteria, we have conducted comparative phylogenomics of molecular functions and of biosynthetic gene clusters (BGCs) in 984 cyanobacterial genomes. Cyanobacteria with host-associated and symbiotic lifestyles were concentrated in the family Nostocaceae, where eight monophyletic clades correspond to specific host taxa. In agreement with previous studies, symbionts are likely to provide fixed nitrogen to their eukaryotic partners, through multiple different nitrogen fixation pathways. Additionally, our analyses identified chitin metabolising pathways in cyanobacteria associated with specific host groups, while obligate symbionts had fewer BGCs. The conservation of molecular functions and BGCs between closely related symbiotic and free-living cyanobacteria suggests the potential for additional cyanobacteria to form symbiotic relationships than is currently known.


Cyanobacterial genomes, habitat annotation and quality control
Assembled genome sequence data for 1078 species belonging to the phylum Cyanobacteria were downloaded from NCBI RefSeq in January 2023 (Supplementary Table S1).An additional 27 metagenomic assembled genomes (MAGs) taxonomically classified as cyanobacteria from lichen sources 38 were included to provide additional examples of host-associated symbionts for a total of 1105 cyanobacterial genomes.
Wherever possible the sampled cyanobacteria were assigned to their source habitat(s) based on available sample metadata, associated publication(s) or metadata describing the original isolation reported by culture collections and as such may include inconsistencies in cases where supporting metadata could not be sourced or lacked sufficient detail.These habitat assignments include aquatic (e.g., freshwater, marine and man-made aquatic sources) and terrestrial (e.g.soils), as well as host-associated environments.Host associations include vascular and non-vascular plants, diatoms, haptophytes, fungi, macroalgae, and marine mammals (epidermal mats).Individual host species were grouped into broad taxonomic categories including bryophytes, cycads, fruit trees (Garcinia macrophylla), diatoms, haptophytes, and lichens.Water fern (Azolla) cyanobacterial symbionts were placed in their own category.These habitat annotations were also used for grouping the cyanobacteria into two broader lifestyle classifications: free-living and host-associated.Cyanobacterial genomes of which no specific source habitat could be discovered were excluded, leaving 1026 cyanobacterial genomes for comparative analyses.

Phylogenetic tree reconstruction
Taxonomic classification of genomes and generation of marker gene alignments was conducted using GTDBtk 41 (v.2.3.0;Supplementary Table S1).Phylogenetic trees were constructed for the final high-quality set of cyanobacterial genomes using IQ-TREE 42 (v.2.2.0).The analysis used the LG + I + G4 model as identified in the IQ tree model finder based on the Bayesian Information Criterion (BIC).A family-level phylogenetic tree for the family Nostocaceae (n = 300), rooted with representatives of the order Elainellales, was constructed using IQ-TREE and the LG + I + G4 model determined by BIC.Phylogenetic trees were visualised using iTOL 43 (v.5).

Genome annotation and KEGG completeness estimation
Cyanobacterial genomes were annotated with Prokka 44 (v.1.14.6) and the resulting gene predictions were functionally annotated with KofamScan 45 (v.1.3.0) to derive Kyoto Encyclopaedia of Genes and Genomes (KEGG) 46 ortholog annotations (Supplementary Table S4).KofamScan predictions were used with KEGG-Decoder 47 (v.1.3) to generate a table representing molecular function completeness across samples (Supplementary Table S3).KEGG functions were classified as being present using two thresholds, either > 98% complete for a more stringent analysis of distribution and complete function, or > 50% complete for lower stringency examination for the potential presence of molecular functions, herein referred to as indicative functions (Supplementary Fig. S2).Presence/absence matrices generated for KEGG functions were used in a phylogenetic logistic regression 48 to identify enrichment of molecular functions based on lifestyle classification at the phylum level (Supplementary Tables S5, S6) and enrichment of molecular functions in individual isolation sources in the family Nostocaceae (Supplementary Tables S7, S8).Phylogenetic logistic regressions were conducted using the phyloglm function in the R package phylolm 49 , using the penalised likelihood with Firth's correction and 100 bootstraps.Responses of lifestyle classification and isolation sources were defined as significant if the p-value was less than 0.05.
Initial assessments of nitrogen fixation capabilities focused on the iron-molybdenum dependent pathway.To explore the presence and distribution of the alternative vanadium-dependent pathway, the presence of KEGG orthologs for both pathways were mapped against a phylogenetic tree of genera of cyanobacteria from the family Nostocaceae which contain host-associated cyanobacterial symbionts.

Biosynthetic gene cluster prediction and classification
BGCs were predicted on cyanobacterial genomes using SanntiS 50 (v.0.9.1) due to high performance on both isolate genomes and MAGs, thus providing consistent annotations across all genome types.The predictions were subsequently filtered to remove those occurring at the edges of contigs and those which were less than 3000 bp in length, reflective of the minimum length of BGCs observed in the MIBiG database 22 .BGCs were initially classified by SanntiS into standard classes such as ribosomally synthesised and post-translationally modified peptides (RiPPs), terpenes, nonribosomal peptides, polyketides, alkaloids, saccharides, and hybrid classes which represent BGCs that cover multiple biochemical classes (Supplementary Table S9).To detect enrichment of total and specific BGC classes in host-associated symbionts, phylogenetic linear regression was conducted at the phylum level (Supplementary Table S10) and in the Family Nostocaceae (Supplementary Table S11).This was performed with the phylolm function using 100 bootstraps and a lambda model for covariance.
To expand upon the basic BGC classifications provided by SanntiS and identify diversity in potential products, predicted BGCs in cyanobacteria were clustered with a large, reference set of biosynthetic gene clusters (the 'reference BGC collection termed RefBGC hereafter).RefBGC includes BGC predictions from running SanntiS on MGnify 51 and RefSeq genomes 52 , as well as the BGCs found in MiBIG 22 , and subsequently refined to only include complete predictions.This clustering enabled the assignment of BGCs to more specific groups based on functional domain composition, utilizing the Louvain community detection method 53 and the Sørensen-Dice similarity coefficient 54 .To refine the SanntiS BGC classification assigned to each group antiSmash 55 (v.7.0.0) predictions were also generated for RefSeq and used to provide more specific natural product annotations, thereby combining the breadth offered by SanntiS and the accurate BGC product assignments provided by antiSMASH.Groups of BGCs containing antiSmash predictions were retained as the final set of BGCs (Supplementary Table S12).The habitat source of each BGC group was use in phylogenetic logistic regression to identify enrichment of specialized biosynthetic gene clusters in cyanobacteria with different lifestyles (Supplementary Tables S13, S14).This was performed with the phyloglm function maximizing the penalized likelihood with Firth's correction across 100 bootstraps.Groups found to be significantly enriched at the phylum level were used to assess phylogenetic signal in the family Nostocaceae using the D-statistic 56 with the phylo.dfunction in the R package caper 57 (v.1.0.2) of lifestyle classification and isolation sources were defined as significant if p-value was less than 0.05.

Enrichment of molecular functions and biosynthetic gene clusters in host-associated cyanobacterial symbionts
Using the taxonomic classifications based on GTDB the cyanobacterial genomes were assigned to 18 taxonomic orders and 42 families, which were monophyletic based on the GTDBtk phylogeny, thus facilitating rigorous interpretation of evolutionary relatedness of these organisms.Of these, Cyanobacteriales (n = 582) and PCC-6307 (representative of Cyanobium gracile; n = 262) comprised over 85% of available genome assemblies (Fig. 1A).Habitat sources were highly skewed, with aquatic environments (n = 756) representing > 75% of environmental sources for all genome assemblies.Notably, only 6% (n = 65) of assessed cyanobacterial genomes were isolated from host-associated environments including non-vascular and vascular plants, diatoms, haptophytes, seaweeds, metazoan epidermal mats and fungi.Within this, Cyanobacteriales accounted for 93% [5.9% of host-associations in all assessed cyanobacterial genomes; n = 61] of all host-associated cyanobacterial symbiont genomes including representatives from all detected habitat source classifications (Fig. 1B).NCBI taxonomy was also considered, however due to challenges with nested, non-monophyletic groupings based on current taxonomic nomenclature, comparisons based on 'taxonomic identity' were not possible.Nevertheless, similar trends were shown with NCBI taxonomy with a high proportion of genomes arising from the orders Synechococcales (n = 429) and Nostocales (n = 301) comprising nearly 75% of available reference genome assemblies with host-associations concentrated in the Nostocales (Supplementary Fig. S3).
KEGG functional annotations were analysed to identify molecular functions enriched in symbiont genomes by exploring the distribution of complete KEGG functions.In total, 78 complete KEGG functions were variably present across the phylum, of which 21 were significantly associated with lifestyle classification (Fig. 2A; Supplementary Fig. S4).Host-associated lifestyles were found to have a significantly higher level of occurrence  The 8837 biosynthetic gene clusters identified were classified into 124 unique groups representative of BGCs, which are likely to produce similar secondary metabolites based on similarity of the protein domain annotations.Although host-associated symbionts were found to have a significantly lower count of BGCs and classes of BGCs as a whole, individual BGC groups were found to be positively associated with cyanobacterial symbionts.Overall, 60 groups were found to be present in both free-living and host-associated cyanobacteria, 62 groups were found only in free-living cyanobacteria, and only 2 groups were found exclusively in host-associated symbionts corresponding to a terpene in a cycad symbiont and 'other' classification in aquatic macrophytes symbionts.Of the 61 groups found in both free-living and host-associated cyanobacteria, 25 were found to have a significantly higher prevalence in host-associated cyanobacteria (Fig. 2B; Supplementary Fig. S5), while only 7 showed a significantly decreased prevalence in host-associated symbionts (Supplementary Table S10).

Host-associated lifestyle appears non-specific with multiple origins in the Nostocaceae
Cyanobacteriales-classified cyanobacteria were recovered as a well-supported monophyly (Fig. 1A) and contained the majority of the symbionts analysed.Within the Cyanobacteriales, the host-associated lifestyle was found to be concentrated in the family Nostocaceae (Fig. 3A).Phylogenetic reconstruction based on marker genes from publicly available high-quality cyanobacterial genomes belonging to the family Nostocaceae revealed a familywide distribution of host-associated growth forms (Fig. 3B).Eight monophyletic clades corresponding to a unique host category (Supplementary Table S15) ranging in levels of host specificity.Denoted clades I-VIII, they derive from: diatoms; Peltigeraceae lichens Solorina crocea and Peltigera malacea; the lichen Peltigera membranacea; Azolla ferns; an unspecified lichen thallus cyanobiont culture ATCC 53789); the lichen Peltigera; Peltigeraceae lichens Collema furfuraceum, Leptogium austroamericanum, Lobaria pulmonaria, Peltigera membranacea, Peltigera aphthosa and Peltigera malacea; and Dioon cycads, respectively.
Ten cyanobacterial genomes were sourced from cycad symbioses but only three of these were found to form a monophyletic clade.Aulosira, previously classified as Nostoc, comprised monophyletic clade VIII.These symbionts were all from a Dioon host supporting previous reports of monophyletic origin of endophytic cyanobacteria with this host species 37 .Cyanobacteria from other cycad hosts (Cycas revoluta (n = 3), Macrozamia (n = 1), Zamia pseudoparasitica (n = 1), Encphalartos horridus (n = 1), and Euterpe edulis (n = 1)) were distributed across the phylogeny.The genomes sourced from Cycas revoluta did not form a monophyletic clade and were distributed across the Nostocaceae tree.The cyanobacterium from the Arecales palm, Euterpe edulis, was found in a clade with the cyanobacterium from Garcina macrophylla, a dicot (Malpighiales) fruit tree.
Clade IV contained 3 of the 5 analyzed Azolla cyanobionts.Notably, the cyanobiont isolated from an epiphytic growth form on Azolla was not found with other true Azolla cyanobionts.
Five of the monophyletic clades, denoted II, III, IV, V and VII, contained 66% (n = 16) of the analysed lichen cyanobionts, and their hosts were all Peltigeraceae fungi.Lichen cyanobionts most distant to the main lichen clades arose from lichens of different family lineages including Coccocarpia palmicola (Coccocarpiaceae) and Placynthium petersii (Placynthiaceae) in more basal origins of the Nostocaceae.While all lichens observed in this analysis were of the order Peltigerales, the mycobiont from these lichens are in a different fungal family compared with those in the other analysed cyanolichens (Peltigeraceae), suggesting the potential for genomic diversity in cyanobionts depending on host identity.
Bryophyte cyanobionts did not form host-specific clades, but instead were often found in clades containing lichen cyanobionts or terrestrial isolates.Bryophyte cyanobionts were limited to three host species: Blasia Family names denoted with an alphabetical suffix represent groups that are not monophyletic in the GTDB reference tree or have unstable placement between releases of GTDB.(B) Cladogram of Nostocaceae generated from an alignment of marker genes rooted with the outgroup of Elainellales (n = 15) to explore the origin of host-specific association.Note that cyanobacterial symbiont UCYN-A Candidatus Atelocyanobacterium thalassa genomes fall within the family Microcystaceae_A according to GTDB and so are outside of subsequent analysis.Genus names denoted with an alphabetical suffix represent groups that are polyphyletic in the GTDB reference tree or have been subdivided in the GTDB reference tree.Genera with host-associations are highlighted, as well as a non-host associated genus of Nostoc (Nostoc_B).Branches with increased line width and highlighted in colour represent eight monophyletic clades containing symbionts arising from single host classifications.

Host-specific molecular specialization in Nostocaceae symbionts
To identify host specialization of cyanobacterial symbionts in the family Nostocaceae, the occurrence of KEGG functions across specific isolation sources was assessed.A total of 72 complete KEGG functions were found across Nostocaceae genomes.19 of these were found in 99% (n = 298; Supplementary Fig. S6) of Nostocaceae genomes including functions of amino acid metabolism (cysteine, threonine, alanine, arginine, histidine, tyrosine, glycine, lysine, proline, serine, and tryptophan), nostoxanthin production, retinal biosynthesis, RuBisCo, starch and glycogen synthesis and degradation, copper transporters, and sulfate adenylyltransferase.An additional 17 were found in more than 90% of Nostoacaeae genomes with functions including additional amino acid metabolism, astaxanthin production, riboflavin biosynthesis and sulfolipid biosynthesis, and Type I secretion systems.Exploration of indicative functions identified the ubiquitous distribution of many additional functions present in 90% of Nostocaceae genomes, including nitrogen fixation, Sec-SRP secretion pathways, chemotaxis, and cobalamin and thiamine biosynthesis.Some of these ubiquitous functions had also been observed to be significantly enriched in host-associated genomes at the phylum level.In addition to the ubiquitous distribution of certain molecular functions, specific isolation sources were also found to be associated with the prevalence of certain molecular functions (Fig. 4; Supplementary Tables S7, S8; Supplementary Fig. S7) Pathways for both iron-molybdenum dependent and vanadium dependent nitrogen fixation were detected in free-living and host-associated cyanobacteria in Nostocaceae genera of cyanobacteria that possess hostassociated cyanobacterial symbionts (Fig. 5).The Iron-molybdenum dependent nitrogen fixation pathway was much more common across the selected taxa.While the vanadium dependent pathway was more common in lichen and bryophyte symbionts, and the outlier case of an epiphytic Azolla symbiont.KEGG orthologs for the vanadium dependent pathway were found in 34% and 17% of free-living cyanobacteria in aquatic and terrestrial environments respectively.In contrast, KEGG orthologs for the vanadium dependent pathway were found in all bryophyte symbionts and over 50% of lichen symbionts.Vanadium dependent nitrogen fixation pathways were not detected in Azolla, Garcina macrophylla or diatom symbionts, and were rarely detected in cycads (n = 1).Notably, both fixation pathways were incomplete with the absence of K00531 (anfG) in the iron-molybdenum dependent pathway, and K22899 (vnfH) in the vanadium dependent pathway.
All 32 groups of BGCs that were shown to be significantly impacted lifestyle classification were detected in the family Nostocaceae (Fig. 6; Supplementary Fig. S9).Of these, 28 groups had a significantly non-random evolutionary-distribution, and those which had non-significant phylogenetic signal (groups 169, 198, 208 and 219) were sparsely present within this family.21 BGC groups were identified to be significantly impacted by specific isolation source with a significantly increased prevalence being observed commonly in multiple terrestrial host-associated environments (e.g., cycad, lichen, bryophytes) alongside free-living terrestrial cyanobacteria.

Discussion
We have compiled and analysed a large dataset of high-quality cyanobacterial genomes to explore the distribution of taxa that are associated with eukaryotic hosts, and to investigate the biochemical diversity and commonalities that distinguish symbionts and free-living isolates.These features could be observed broadly at the phylum level in both molecular functions (as predicted through KEGG orthologs) and BGCs.Broadly, these specialized functions can be summarized into 4 key categories: nitrogen fixation, carbohydrate utilization, environmental communication, and mediation of biotic interactions via secondary metabolite production.We both confirm some of the current understanding of cyanobacterial symbiotic associations and identify novel features in symbiont genomes that may be unique to symbiotic associations of specific host types.
The provision of fixed nitrogen to their eukaryotic hosts is one of the key benefits of cyanobacterial symbiosis in plant 11,13,14 , lichens 58,59 , and planktonic marine protists 60 including diatoms 61 and haptophytes 62 .We found enrichment of the iron-molybdenum dependent nitrogen fixation pathways in host-associated cyanobacterial symbionts across the phylum and ubiquitously in the family Nostocaceae, supporting this as one of the key mutualistic beneficial services.Nitrogen fixation in cyanobacteria requires iron 63 and has also been shown to require manganese in legume nodule bacterial symbionts 64,65 , and we demonstrated increased occurrence of Fe-Mn transporters in host-associated cyanobacteria at the phylum level and in cycad and lichen  .Distribution of significant KEGG functions impacted by isolation source in Nostocaceae genera which include host-associated cyanobacterial symbionts.KEGG functions found to be significantly impacted by specific isolation sources including host-associated symbionts from aquatic macrophytes, epidermal matss of the bottlenose dolphin, bryophytes, cycads, diatoms, lichens, a fruit tree species (Garcina macrophylla) and the water fern, Azolla.Permission for utilizing the KEGG database was obtained from Kanehisa laboratories 46 .symbionts within the family Nostocaceae.Cyanobacterial symbionts were also shown to have a co-occurring alternative vanadium-dependent nitrogen fixation pathway presenting an alternative pathway in conditions of low molybdenum availability.This alternative pathway has previously been reported in other cyanobacterial symbionts arising from rice 66 , bryophytes 67 , and lichens 68,69 supporting our detection of this pathway in lichen and bryophyte symbionts in this study.We rarely detected vanadium-dependent nitrogen fixation pathways in cycad coralloid root cyanobacterial symbionts suggesting that this pathway is not prevalent in all cyanobacterial symbionts.Similarly, alternative nitrogen fixation pathways have not previously been detected in other root associated bacteria of plants (e.g., Rhizobia root nodules of legumes) 70 .Further investigation into the occurrence and impact of alternative nitrogen fixation in cyanobacterial symbionts is still required.
Carbohydrate-active enzymes including chitinase and glucoamylase were found to have a significantly higher prevalence in host-associated cyanobacterial symbionts.While cyanobacteria are known for their photoautroophic metabolism, they are capable of heterotrophic growth utilizing carbohydrates such as glucose or xylanase 71 .The heterotrophic growth of cyanobacteria has been observed in Nostoc symbionts from cycads grown in the dark 72,73 .Thus, the enrichment of carbohydrate active enzymes such as glucoamylase could drive the heterotrophic potential of host-associated cyanobacterial symbionts.
Specificities in carbohydrate utilization associated with host types was observed in the family Nostocaceae with chitinase having a significantly higher prevalence in cycad symbionts.Chitin, a highly abundant polysaccharide, is a key component in the cell walls of fungi 74,75 and may serve as a source of nitrogen for cyanobacterial and algal growth 74 or for antifungal activity which may prove advantageous to the host.The presences of carbohydrate utilization genes in bacteria are related to the habitats they are isolated from, with enrichment of carbohydrate metabolism correlated with the carbohydrate composition of the environment 76 .The potential for microbes to target the fungal cell wall to prevent pathogenic fungal infection of plant hosts 75 suggests a potential additional mutualistic benefit of the cyanobacterial symbionts found in cycads.The relative absence of chitinase activity loci in lichen symbionts demonstrates a potential selection against antifungal activity and a key difference in fungal versus plant-cyanobacterial symbioses.While the other enriched carbohydrate-active enzymes observed at the phylum level were not found to be enriched in specific host types, it will be interesting to explore in more detail the trends in distribution of carbohydrate active enzymes in cyanobacteria to align these results with patterns previously reported across the prokaryotic tree of life 76 and specificities of carbohydrate enzyme activity for host specific tissues such as the presence of lichenases in Trebouxia (chlorophyte algae) in lichen symbionts 77 .
With the exception of the water fern, Azolla 13 , the majority of cyanobacterial symbionts are not permanently associated with the host.Thus, cyanobacterial symbionts require the ability to sense and locate hosts.This may be achieved through chemotaxis involving signal transduction pathways in response to chemical  www.nature.com/scientificreports/attractants produced by plants 78 and the ability to sense chemoattractants has proven to be critical in the formation of plant symbioses 78,79 .Consideration of partially complete KEGG functions revealed chemotaxis to have a higher prevalence in host-associated cyanobacteria but is not significant (p = 0.084).This function was also observed across the Nostocaceae taxa correlating with the occurrence of host-associated symbionts.The enrichment of motility functions has also been previously reported in terrestrial cyanobacteria 80 .As the majority of these symbiotic associations, especially true of those found in terrestrial systems, are facultative for the cyanobacteria 11,13 , this raises the important question of whether free-living cyanobacteria that possess these characteristics are also potential symbiotic partners and whether the diversity of symbiotically competent cyanobacteria is significantly higher than currently reported.
In addition to the ability to sense and respond to their environment, two secretion systems (Type I secretion systems and Sec-SRP) were also found to have a significantly higher likelihood of occurrence in host-associated symbionts suggesting specialization to release products into the environment.While other secretion systems are known to be used to colonize hosts for pathogenic and symbiotic activity (e.g., Type III secretion systems transporting product directly into a eukaryotic cell) 81 , Type I secretion systems are capable of transporting products to the extracellular space in a single step 82 .As observed in bacteria that promote plant growth, the benefit of these microbial partners is often dependent on the secretion systems 83 .However, in the case of the cyanobacterial symbionts, the questions of what beneficial and symbiotically critical compounds may be produced and released by these organisms and how they vary depending on the eukaryotic host remains unexplored.
One of the most notable patterns in the distribution of classes of biosynthetic gene clusters was observed in Nostocaceae symbionts of the water fern, Azolla.These symbionts consistently had a significantly lower number of total BGCs, which was paralleled in specific classes including nonribosomal peptides, nonribosomal peptide polyketides, RiPPs,terpenes, and 'other' .Cyanobacterial symbionts of Azolla represent the only currently known permanent obligate symbionts 13 .As secondary metabolites, particularly terpenes, often have roles in mediating complex ecological interactions 7 , so the reduced BGC content in these obligate symbionts may be representative of the reduced complexity of their environment.As Azolla symbionts are permanently associated with their host, the requirement for response to environmental stress and to mediate interactions with other organisms is reduced in comparison to cyanobacterial symbionts located in facultative mutualisms where they also need to survive as free-living bacteria.Reduced numbers of RiPPs were observed in lichen symbionts.RiPPs have very diverse functions ranging from quorum sensing to antifungal and antibacterial properties 84 .Metagenomic sequencing of lichens has forced a reconceptualisation of the symbiosis from a one mycobiont-one photobiont model to one that encompasses additional fungal partners and a diverse microbiome 38,85 .This diversity may play a critical role in the growth of the lichen 38 .That lichen cyanobionts have fewer RiPPs may reflect adaptation to coexistence in this diverse community, and is a topic worthy of deeper analysis.
In contrast to overall reduced counts of biosynthetic gene clusters, symbionts in bryophytes and fruit trees were found to have increased numbers of BGCs predicted to produce terpenes, alkaloids, nonribosomal peptides, and polyketide saccharides.These BGC systems may be responsible for important ecological interactions 24 .Examination of specific unique groups of BGCs in the family Nostocaceae notably revealed that these groups occur in both free-living and host-associated cyanobacteria, and are often not restricted to individual host types.We note that this pattern contrasts previous research suggesting niche specific BGCs only in cycad symbionts 37 .Cyanobacterial isolates from cycads have also been shown to be capable of forming symbiotic associations in laboratory conditions with mosses, mycorrhizal fungi and Gunnera (a flowering plant) 19 .The ability for cyanobacterial symbiont isolates to form associations with a broad range of hosts supports our findings of the potential of unspecific host symbiotic competence in secondary metabolite profiles as demonstrated by our large-scale analyses of cyanobacteria and cyanobacterial symbionts.
Previous phylogenetic reconstruction of Cyanobacteria has presented contrasting conclusions concerning the relationships of symbiotic isolates: (i) proposing clades that are comprised of cycad, bryophyte and lichen symbionts 86 ; (ii) separation into clades representative of extracellular or intracellular/extracellular symbionts 11 ; (iii) grouping of lichen symbionts 87 ; or (iv) grouping of plant-associated symbionts 37 .We found host-associated cyanobacteria were scattered across the phylogeny, with few monophyletic clades of symbionts, as previously reported for Nostoc isolates from lichen symbionts 36 .Monophyletic clades of cyanobionts involved in symbioses were detected in isolates from diatoms, Dioon cycads, sets of Peltigeraceae lichens and the water fern, Azolla.In Nostocaceae the basally arising host-associated samples corresponded to lichen symbionts associated with the fungal families Coccocarpiaceae and Placynthiaceae.The other Nostocaceae lichen symbionts analysed were associated with fungal family Peltigeraceae, and were placed intermixed with free-living, Azolla-associated and bryophyte-associated isolates.As the lichen fungal partner is known to display a preference in photobiont acquisition 88,89 , it may be that Coccocarpiaceae and Placynthiaceae fungi have a different range of potential partners than the Peltigeraceae.It will be highly informative to generate genomic data for additional, diverse cyanolichens.
In many cyanobacterial symbioses the symbiont may be found associated with a host or as a free-living form: these life habits are not mutually exclusive.The availability of free-living cyanobacteria in surrounding environments influences the symbiotic partners found in host associations 13,90 and free-living cyanobacteria closely related to symbiont clades may prove to be potential symbiotic partners.The increased prevalence of specific BGCs observed across both free-living cyanobacteria in terrestrial environments and symbionts found in terrestrial host-associations (e.g., lichens, cycads, bryophytes) further demonstrates this potential for an increased diversity in cyanobacterial symbionts than has currently been observed.Future research focused on generating novel cyanobacterial genomes from additional symbiotic associations will be critical in advancing the understanding of host range and symbiont diversity in the phylum Cyanobacteria.

Figure 1 .
Figure1.Phylogeny and distribution of host-associated lifestyles in the phylum, Cyanobacteria.(A) Phylogeny generated using concatenated marker genes from GTDBtk of genome sequences of strains from phylum Cyanobacteria, rooted with representatives of the sister group, Melainabacteria, with 1000 bootstraps.Branches with low bootstrap support (< 50%) are shown in red.The outer annotation track depicts the lifestyle classification to highlight host-associated cyanobacterial symbionts.The inner annotation track depicts the classified taxonomic order assigned by GTDB.Nostocaceae, a family in the Cyanobacteriales, containing the majority of host associations, are shaded in light blue.(B) Frequency counts distributed across taxonomic orders for habitat classifications highlighting the different host sources including vascular plants (water fern (Azolla), cycad, a fruit tree (Garcinia macrophylla), aquatic macrophytes (Hydrilla verticillata), non-vascular plants (bryophytes), diatoms, haptophytes, macroalgae (Rhodophyta), fungi, and epidermal mats of aquatic mammals such as dolphins.

60 MFigure 3 .
Figure 3. Distribution of host-types in the order Cyanobacteriales and the origin of host associations in Nostocaceae.(A) Frequency counts distributed across taxonomic families in the order Cyanobacteriales which includes the majority of host-associated cyanobacterial symbiont genomes spanning a high diversity of eukaryotic hosts in the family, Nostocaceae.Families with low frequency counts are displayed as an inset panel.Family names denoted with an alphabetical suffix represent groups that are not monophyletic in the GTDB reference tree or have unstable placement between releases of GTDB.(B) Cladogram of Nostocaceae generated from an alignment of marker genes rooted with the outgroup of Elainellales (n = 15) to explore the origin of host-specific association.Note that cyanobacterial symbiont UCYN-A Candidatus Atelocyanobacterium thalassa genomes fall within the family Microcystaceae_A according to GTDB and so are outside of subsequent analysis.Genus names denoted with an alphabetical suffix represent groups that are polyphyletic in the GTDB reference tree or have been subdivided in the GTDB reference tree.Genera with host-associations are highlighted, as well as a non-host associated genus of Nostoc (Nostoc_B).Branches with increased line width and highlighted in colour represent eight monophyletic clades containing symbionts arising from single host classifications.

Figure 4
Figure 4. Distribution of significant KEGG functions impacted by isolation source in Nostocaceae genera which include host-associated cyanobacterial symbionts.KEGG functions found to be significantly impacted by specific isolation sources including host-associated symbionts from aquatic macrophytes, epidermal matss of the bottlenose dolphin, bryophytes, cycads, diatoms, lichens, a fruit tree species (Garcina macrophylla) and the water fern, Azolla.Permission for utilizing the KEGG database was obtained from Kanehisa laboratories46 .

Figure 5 .
Figure 5. Distribution of KEGG orthologs associated with nitrogen-fixation pathways in Nostocaceae genera which include host-associated cyanobacterial symbionts.Presence of KEGG orthologs associated with ironmolybdenum and vanadium dependent nitrogen-fixation pathways in host-associated Nostocaceae genera.Distribution of KEGG orthologs in both pathways are shown across the phylogenetic tree on the left-hand side.Both pathway types were partially complete with the absence of K00531 (anfG) in the iron-molybdenum dependent pathway, and K22899 (vnfH) in the vanadium dependent pathway.Permission for utilizing the KEGG database was obtained from Kanehisa laboratories 46 .

Figure 6 .
Figure6.Distribution of biosynthetic gene cluster groups impacted by isolation source in Nostocaceae genera which include host-associated cyanobacterial symbionts.The distribution of 32 BGC groups, denoted by unique numeric IDs, identified as being significantly impacted by lifestyle-classification (i.e.free-living vs. hostassociated in genera of Nostocaceae with host-associated cyanobacterial symbionts).Axis labels for BGC group IDs are coloured according to the consensus of BGC class per group.BGC group IDs marked with an asterisk (*) represent groups found to have a non-significant evolutionary distribution.
(B) Coefficient estimates from phylogenetic regressions for unique groups of biosynthetic gene clusters (BGC), denoted by unique numeric identifiers, that are significantly impacted by lifestyle classification.Negative coefficient estimates indicate decreased likelihood of detection in host-associated symbionts.Axis labels for BGC group IDs are coloured according to the consensus of BGC class per group.(C) Distribution of counts of total detected biosynthetic gene clusters and classes of biosynthetic gene clusters shown to be significantly impacted by lifestyle classification across the phylum Cyanobacteria. of functions including those of glucogenesis (p = 0.047; Est.0.83), Fe-Mn transporter (p = 6.35e−06;Est.1.30), glucoamylase (p = 4.58e−04; Est.1.31), zeaxanthin diglucoside production (p = 6.07e−04;Est.1.00), cobaltmagnesium transporters (p = 3.73e−02;